Machine learning classification of autism spectrum disorder based on reciprocity in naturalistic social interactions

Autism spectrum disorder is characterized by impaired social communication and interaction. As a neurodevelopmental disorder typically diagnosed during childhood, diagnosis in adulthood is preceded by a resource-heavy clinical assessment period. The ongoing developments in digital phenotyping give rise to novel opportunities within the screening and diagnostic process. Our aim was to quantify multiple non-verbal social interaction characteristics in autism and build diagnostic classification models independent of clinical ratings. We analyzed videos of naturalistic social interactions in a sample including 28 autistic and 60 non-autistic adults paired in dyads and engaging in two conversational tasks. We used existing open-source computer vision algorithms for objective annotation to extract information based on the synchrony of movement and facial expression. These were subsequently used as features in a support vector machine learning model to predict whether an individual was part of an autistic or non-autistic interaction dyad. The two prediction models based on reciprocal adaptation in facial movements, as well as individual amounts of head and body motion and facial expressiveness showed the highest precision (balanced accuracies: 79.5% and 68.8%, respectively), followed by models based on reciprocal coordination of head (balanced accuracy: 62.1%) and body (balanced accuracy: 56.7%) motion, as well as intrapersonal coordination processes (balanced accuracy: 44.2%). Combinations of these models did not increase overall predictive performance. Our work highlights the distinctive nature of non-verbal behavior in autism and its utility for digital phenotyping-based classification. Future research needs to both explore the performance of different prediction algorithms to reveal underlying mechanisms and interactions, as well as investigate the prospective generalizability and robustness of these algorithms in routine clinical care.


S2.1. Facial Expression Synchrony
In order to prevent biases by imprecise facial tracking, we only included participants with a mean confidence of tracked frames over 75%, as well as a percentage of successfully tracked frames over 90%.We were interested in the synchrony of action units rather than emotional facial expressions.According to the Facial Action Coding System (FACS 2 ), emotional facial expressions are combinations of certain action units.The correspondence between emotion and combination of action units is assumed to be bidirectional, for example, happiness is assumed to be associated with the activation of the action units 6 and 12, and the activation of these action units is assumed to always convey happiness.This, however, is not appropriate for this study: (1) It is possible that an action unit, though present, is not detected by the automated algorithm used in this study, e.g., due to partial occlusion of the face.In this case the emotional facial expression synchrony would erroneously be coded as non-existent.Assessing all action units separately would still capture synchrony of all correctly detected action units.(2) Inferring emotional states from facial expressions has shown to be less straightforward than originally assumed 3 .Moreover, autistic people have been found to demonstrate even less coherence in their emotional expressions and their underlying emotional states 4 .
For the cross-correlation of all facial AUs, the time series were split into windows of seven seconds (in accordance with 5 ), which were lagged by two seconds respectively.To prevent information loss at window boundaries 6 , we performed the cross-correlations in steps of 4 seconds, returning a matrix of 17,908 cross-correlation values per dyad per task.In line with previous synchrony studies using MEA (e.g., 7,8 ), all cross-correlation values were Fisher's Ztransformed and converted to absolute numbers.In time windows where no movement was present by either interactant, the cross-correlation was labelled as missing.To avoid overfitting, in our final machine learning analysis we only included action units for which none of the participants had more than 50% of missing data.For the mealplanning task, these were action units 1, 2, 6, 7, 9, 14, 15, 17, 20, 25, 26, and 45.For the hobbies task, we included action units 1, 2, 6, 7, 9, 15, 17, 20, 23, 25, 26, and 45.A description of the relevant action units can be found in Supplementary Table S1.

S2.2. Head Movement Synchrony
To account for the dynamic nature of the interaction tasks 6 , the extracted motion energy time series for head movement were synchronized in windows of 30 seconds and 5 second lags.An overlap of 15 seconds was chosen in order to capture instances of synchrony between windows, resulting in a cross-correlation matrix of 11,438 values per dyad per task.
As is common practice in the application of MEA, we assessed whether the synchrony scores derived from the extracted motion energy values were above chance.To this end, we shuffled our datasets to create 500 pseudo-dyads, thus, pairing the time series of two people who had never actually interacted with each other.Using windowed cross-lagged correlation, we subsequently calculated their interpersonal synchrony in the same manner as the real dyads with window sizes of 30 seconds, increments of 15 seconds and lags of 5 seconds.The resulting cross-correlations were averaged across all windows and lags, resulting in one global synchrony value per pseudo-dyad and compared to the averages derived from the real dyads using independent Welch t-tests.Pseudo-synchrony in the head ROI (M = .075,SD = .013)was significantly lower that real head synchrony (M = .081,SD = .018),suggesting the head synchrony found between our participants to be above chance.

S2.3. Body Movement Synchrony
Body motion energy time series were processed and synchronized in the same manner as head motion energy.For the comparison to pseudosynchrony, we found a group difference suggesting above chance body synchrony in our interactional dyads (pseudo: M = .087,SD = .016vs. real: M = .089,SD = .018).However, contrary to our hypothesis this result was not significant.

S2.4. Derivation of individual feature vectors and peak-picking -from shared interpersonal synchrony to individual adaptation
One way to establish interpersonal synchrony is through the adaptation of one person to another.
All resulting interpersonal synchrony cross-correlation matrices (facial expression, head motion, and body motion) were split according to the direction of the lag (Supplementary Figure S1).This allowed for quantification of the degree of adaptation of every participant within their dyad 9 .Subsequently, a peak-picking algorithm 10,11 was used to extract the maximum adaptation per time window in every task.Summary statistics (mean, median, standard deviation, minimum, maximum, skewness, and kurtosis) of these peak values per task constituted the final feature set.

S2.5. Head-body coordination
For the purpose of quantifying head and body integration, a three-dimensional head motion vector was derived from the three movement axes (pitch, yaw and roll) using the following formula: ∆ !" : frame-to-frame difference vector in head movement on the x-axis with respect to the camera ∆ !$ : frame-to-frame difference vector in head movement on the y-axis with respect to the camera ∆ !% : frame-to-frame difference vector in head movement towards to and away from the camera The resulting head movement vector was subsequently cross correlated with the body motion time series derived from MEA.A window size of 30 seconds with lags of 5 seconds and a step size of 15 seconds was chosen in conformity with the calculation of interpersonal movement synchrony.Peak synchrony instances of every time window were extracted, with their summary statistics (mean, median, minimum, maximum, skewness, and kurtosis) per task constituting the final feature set for intrapersonal movement coordination.

S2.6. Total Movement and Facial Expressiveness
We aimed to quantify both relative amount of head and body movement and overall facial expressiveness of the individual participants during our testing sessions.Movement quantity of head and body ROI was derived from the respective MEA time series of the two tasks.
Following procedures from previous MEA publications 12,13 , movement quantity was defined as the number of frames with changes in motion energy divided by the total number of frames, resulting in four values per participant.Facial expressiveness was operationalized as mean intensity of all action units derived from OpenFace 2.0 14 included in our facial expression classifier (AU time series with <50% missing values) per task, resulting in two features per participant.

S3.1. Machine Learning Preprocessing Pipeline for base models
In a first step, all base models underwent pruning of uninformative features (features with 0 variance).Subsequently, all features were scaled from 0 to 1 to remove potential effects of scale differences.Due to the relatively large amounts of features, further pre-processing was conducted on the base models of facial expression synchrony and head movement synchrony.
The base models of facial expression synchrony and head movement synchrony underwent the following additional processing: (1) To reduce the dimensionality, features were pre-processed using principal component analysis (PCA), retaining the principal components that explained 80% of the variance in each CV1 fold 15 .
These represent the default parameter settings in Neurominer 15 .An ensemble of the top 50% performing models was created for each base learner that was subsequently applied to the outer CV2 data to produce a single average robust prediction.

S3.2. Machine Learning Preprocessing Pipeline for Stacking Models
We trained two different stacking models to investigate if a combination of modalities in the head region, as well as a combination of all base classifiers could improve prediction accuracy even further.The stacking models combined decision scores of the respective base classifiers (FACEsync + HEADsync; FACEsync + HEADsync + BODYsync + INTRAsync + MoveEx) within each CV1 partition, standardizing the resulting matrices and subsequently using them as new sets of predictive features, which replaced the original features in each CV1 partition.
Subsequently, the CV2 validation predictions of the previously trained base classifiers' SVM ensembles were combined and standardized using the median and winsorized within 3 standard deviations to their closest percentile.Then, each SVM ensemble was applied to this standardized CV2 decision score matrix.Majority voting was used to achieve class prediction.

S3.3. Permutation testing description
We employed permutation testing to assess whether our base prediction models were statistically significant 16 .To this end, we performed 1000 random permutations of the outcome labels (ASD-TD and TD-TD).All linear SVM models were retrained for each permutation in the same stratified repeated nested double CV using the respective feature subsets obtained from the observed-label analyses.Subsequently, we accumulated the predictions of the random models for each permutation into a permuted ensemble prediction for each outer cycle subject.
Thus, we built a null distribution of out-of-training classification performance (BAC) for every base classifier.We then calculated the significance of the observed out-of-training accuracy as the number of events where the permuted out-of-training accuracy was higher or equal to the observed BAC divided by the number of permutations performed.The significance of each model was determined at α=0.05, FDR corrected.We performed two separate permutation analyses for each model, the latter of which is reported in the main manuscript:

S3.4. Feature visualization of facial expression features
To determine the influence of the different features on the predictive BAC on the individual level, features were visualized for the best-performing model (FACEsync).Namely, this included the calculation of the weights of the individual features, the cross-validation ratio, and the sign-based consistency.The feature weights (Supplementary Figure S2) calculated by Neurominer were defined as the median weights of the selected CV1 models for each CV2 fold divided by the number of CV2 folds 15 .Cross-validation ratio, a measure of stability, was defined as the sum across CV2 folds of the CV1 median weights divided by their respective CV1 standard error, all of which was subsequently divided by the number of CV2 folds 17 .The sign-based consistency 18 was defined as the number of times that the sign of each feature (positive or negative) was consistent within an ensemble multiplied by the number of times that the feature was non-zero and calculated according to the following procedure 15 : The measure is between 0 to 1, with 1 representing perfect consistency within the ensemble and 0 if the weights are equally positive and negative or when the feature is omitted with a zero weight.A p-value was then calculated by defining a hypothesis test for the importance score with a null hypothesis of 0. A z-score was calculated as the importance divided by the square root of the variance of the importance scores.A standard p-value was then calculated using a normal cumulative distribution function to choose the right-tailed significance.P-values were corrected using the false-discovery rate.
Supplementary Figure S2.Feature weights for FACEsync model.

S4. Supplementary Results
We conducted a range of exploratory analyses to further characterize our sample with standardized clinical self-ratings (S4.1).Moreover, we aimed to explore the impact of our experimental setup on naturalistic interactional behavior of our participants (S4.2 and S4.3).
All participants underwent a facial expression recognition classification task whose groupbased results can be found under S4.4.To further investigate potential underlying factors driving misclassification within our base models, we conducted additional correlational analyses (S4.5).We additionally explored how our interactional setup performed in a series of machine learning classification models based on individual diagnosis (S4.6).Lastly, in an exploratory analysis, we additionally conducted a classification using a Random Forest algorithm for our five base models and the two stacking models (S4.7).

S4.1. Clinical characteristics of correctly vs. incorrectly classified autistic participants
We assessed clinical characteristics of our sample using a range of neuro-psychological selfrating questionnaires.Results based on a group comparison between patients and control participants can be found in Supplementary Table S2.
Further, we were interested whether there were any notable differences in clinical characteristics of the autistic participants within our ASD-TD interactions who were classified correctly (true positive, TP) vs. incorrectly (false negative, FN) as belonging to a non-autistic control dyad.For this, we ran a series of Welch independent sample two-sided t-tests within the autistic subsample for all base classification models.Supplementary Figure S3 depicts the group differences for the FACEsync model.While TP autistic participants on average had lower alexithymia scores (M = 57.91)than FN autistic participants (M = 86.60),this difference did not survive FDR correction (p = .07).We found no significant differences between autistic participants for the other models (see Supplementary Tables S7-10).

S4.2. Camera influence
Since we aimed to capture naturalistic social interactions, we assessed the degree to which the participants felt influenced by the cameras being present.Participants could rate the degree of influence on a 4-point-scale ranging from 0 (= not at all), 1 (= a little), 2 (= considerably), to 3 (= very much).Ratings can be seen found in Supplementary Figure S4.Perceived influence of camera on interaction.
Supplementary Figure S4.Perceived influence of camera on interaction.

S4.3. Impact of COVID-19 measures on interaction
Shortly after the beginning of data collection (the first nine control dyads), the increased hygienic safety measures caused by the Covid-19 pandemic required slight changes in setup.
Amongst these changes were a different testing room, as well as the installing of a transparent, plastic screen between the participants to reduce the risk of airborne infections.In an effort to reduce any distracting effects of mirror images on the screen, a transparent anti-reflection foil was applied to it.To rule out any biases regarding those hygienic measures on the features used for classification, we conducted two-sample Welch t-tests between the features of the TD-TD dyads before and after the change in setup (Supplementary Table S3).Additionally, after each testing we asked participants to rate the perceived impairment of the plexiglass, as well as the perceived rapport with their conversational partner (Supplementary Figure S5).Analyzing group differences in perceived rapport between the TD-TD dyads did not result in statistically significant differences before and after the installation of the screen (Supplementary Table S3).

S4.7. Random Forest Classification
While SVM is a commonly used machine learning algorithms for classification problems, other algorithms exist that tend to perform well with small datasets.For instance, Random Forest is a supervised machine learning algorithm, that combines predictions from a large number of independent decision trees, resulting in the best possible classification result 21 .Using the same preprocessing pipelines and repeated nested cross-validation structure, we retrained our five base and two stacking models with a random forest classifier in Neurominer.Detailed results can be found in Supplementary Table S14.

S5. Supplementary tables
Supplementary Table S1.Action Units included in this study as extracted by OpenFace.
Action Unit Definition according to FACS 'kurtosis_mealplanning_intra' Note.The SVM models were trained on a total of 88 participants, with n = 56 participants belonging to a mixed dyad (ASD-TD) vs. n = 32 participants belonging to a non-autistic control dyad (TD-TD).min = minimum, max = maximum, sd = standard deviation, skew = skewness, md = median, AU = action unit.
(a) with all labels of all participants randomly permuted, and (b), taking into account the dyadic structure of the data, with labels randomly permuted while ensuring both participants of each dyad were always permuted together.For two models (namely BODYsync and INTRAsync) this resulted in slight changes in the models' exact p values (BODYsync (a) p = .011,(b) p = .009;and INTRAsync (a) p = .999,(b) p = .994).
FACEsync model.Note.ADC = Adult Dyspaxia Checklist, AQ = Autism Quotient, BDI = Beck Depression Inventory, CFT = Culture Fair Test 20-R, MWT = Mehrfach-Wortschatz-Test, SMS = Self -Monitoring Scale, SPF = Saarbrücker Persönlichkeitsfragebogen, TAS20 = Toronto Alexithymia Scale.The ADC is composed of a section covering movement difficulties in childhood and adulthood.Both scores are depicted in this figure.Bars represent inter-quartile ranges the camera influence you during the interaction?

Table S3 .
Supplementary TableS2.Clinical self-ratings of autistic and control participants.Note.Mean parameter values (SD in parentheses) for each of the questionnaires are shown for the ASD and TD participants, as well as the results of Wilcoxon tests (assuming unequal variance).BDI includes one missing value for TD.AQ = AutismQuotient.SPF = Mean group differences in perceived rapport before and after the Plexiglas setup for TD-TD dyads (n = 16 per group).

Table S4 .
Pearson correlation coefficients between the decision scores of the different base models for the participants from the ASD-TD dyad type Note.M and SD are used to represent mean and standard deviation, respectively.pvaluesareFDR corrected.***indicatesp< .001.Note.M and SD are used to represent mean and standard deviation, respectively.pvaluesareFDR corrected.*indicatesp< .05.Supplementary TableS6.Additional classification metrics for the ASD-TD vs. TD-TD SVM models.

Table S7 .
FACEsync MODEL: Comparison Analyses of Clinical Variables between Correctly Classified and Misclassified ASD participants.

Table S8 .
HEADsync MODEL: Comparison Analyses of Clinical Variables between Correctly Classified and Misclassified ASD participants.

Table S9 .
BODYsync MODEL: Comparison Analyses of Clinical Variables between Correctly Classified and Misclassified ASD participants.Supplementary Table S10.INTRAsync MODEL: Comparison Analyses of Clinical Variables between Correctly Classified and Misclassified ASD participants.Supplementary Table S11.MovEx MODEL: Comparison Analyses of Clinical Variables between Correctly Classified and Misclassified ASD participants.